EVOLUTION ON DISTRIBUTIVE LATTICES 



NIKO BEERENWINKEL*, NICHOLAS ERIKSSON, AND BERND STURMFELS 
DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF CALIFORNIA 
BERKELEY, CA 94720, USA 
{NIKO,ERIKSSON,BERND}@MATH.BERKELEY.EDU 
•CORRESPONDING AUTHOR: 
PHONE; +1 (510) 642-3529, FAX: +1 (510) 642-8204 

Abstract. We consider the directed evolution of a population after an 
intervention that has significantly altered the underlying fitness land- 
scape. We model the space of genotypes as a distributive lattice; the 
fitness landscape is a real-valued function on that lattice. The risk of 
escape from intervention, i.e., the probability that the population devel- 
ops an escape mutant before extinction, is encoded in the risk polyno- 
mial. Tools from algebraic combinatorics are applied to compute the risk 
polynomial in terms of the fitness landscape. In an application to the 
development of drug resistance in HIV, we study the risk of viral escape 
from treatment with the protease inhibitors ritonavir and indinavir. 
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1. Introduction 

The evolutionary fate of a population is determined by the replication 
dynamics of the ensemble and by the reproductive success of its individuals. 
We are interested in scenarios where most individuals have a low fitness, 
eventually leading to extinction, and only a few types of individuals ("escape 
mutants") can survive permanently. These situations often arise due to 
a significant change of the underlying fitness landscape. For example, a 
virus that has been transmitted to a new host is confronted with a new 
immune response. Likewise, medical interventions such as radiation therapy, 
vaccination, or chemotherapy result in altered fitness landscapes for the 
targeted agents, which may be bacteria, viruses, or cancer cells. 

Given a population and such a hostile fitness landscape, the central ques- 
tion is whether the population will survive. In the case of medical interven- 
tions we wish to know the probability of successful treatment. Answering 
this question involves computing the risk of evolutionary escape, i.e., the 
probability that the population develops an escape mutant before extinction. 
We present a mathematical framework for computing such probabilities. 
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Figure 1. An event poset, its genotype lattice, and a fitness landscape. 

Our primary application is the evolution of drug resistance during treat- 
ment of HIV infected patients [Hj. We consider therapy with two different 
protease inhibitors (Pis). These compounds interfere with HIV particle 
maturation by inhibiting the viral protease enzyme. The effectiveness of PI 
therapy is limited by the development of drug resistance. Rapid and highly 
error prone replication of a large virus population generates mutants that 
resist the selective pressure of drug therapy. PI resistance is caused by mu- 
tations in the protease gene that reduce the binding affinity of the drug to 
the enzyme. These mutations have been shown to accumulate in a stepwise 
manner [5]. For most Pis, no single mutation confers a significant level of re- 
sistance, but multiple mutations are required for escape from drug pressure. 
Quantitative predictions of the probability of successful PI treatment would 
help in finding effective antiretroviral combination therapies. Selecting a 
drug combination amounts to controlling the viral fitness landscape. 

We regard the directed evolution of a population towards an escape state 
as a fluctuation on a fitness landscape. The space of genotypes is modeled 
as follows. We start with a finite partially ordered set (poset) 8 whose 
elements are called events. The events are non-reversible mutations with 
some constraints on their order of occurrence. Such constraints are primarily 
due to epistatic effects between different loci in a genome [Z]. The event 
constraints define the poset structure: ei < 62 in £" means that event ei 
must occur before event 62 can occur. Each genotype g is represented by a 
subset of namely, the set of all events that occurred to create g. Thus a 
genotype g is an order ideal in the poset £. The space of genotypes Q is the 
set of all order ideals in which is a distributive lattice |25[ Sec. 3.4]. The 
order relation on Q is set inclusion and corresponds to the accumulation 
of mutations. This mathematical formulation is reasonable in the above 
situations, where a population is exposed to strong selective pressure. 

The risk of escape is governed by the structure of the fitness function 
on Q, and the population dynamics (such as the mutation rates and pop- 
ulation size). Our focus is on the dependency of the risk of escape on the 
assigned fitness values for each genotype g & Q- This leads us to the risk 
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polynomial, which is shown to be equivalent to a well-known object in alge- 
braic combinatorics. Indeed, one of the objectives of this work is to provide 
a bridge between algebraic combinatorics and evolutionary biology. 

This paper is organized as follows. In Section |2l we formalize our model 
of a static fitness landscape on the genotype lattice Q derived from an event 
poset £, and we discuss evolution on the lattice Q. In SectionlHlwe review the 
multistate branching process studied by Iwasa, Michor and Nowak |14| I15j . 

In Section 0] we study the Bayesian networks which arise from identifying 
the events in £ with binary random variables. These statistical models 
can be used to infer the genotype space from given data. For conjunctive 
Bayesian networks we recover the distributive lattice of order ideals in Of 
particular interest is the case where £^ is a directed forest: here the Bayesian 
network is a mutagenetic tree model |31ll]- The application of our methods 
to the development of PI resistance in HIV is presented in Section |SJ 

The Appendix summarizes various representations of the risk polynomial 
in terms of structures from algebraic combinatorics. Efficient methods for 
computing the risk polynomial and their implementation are presented. 

2. Fitness landscapes on distributive lattices 

A partially ordered set (or poset) is a set £ together with a binary relation, 
denoted "<", which is reflexive, antisymmetric, and transitive. Here we fix a 
finite poset £ whose elements are called events. If the number of events is n 
then we often identify the set underlying £ with the set [n] = {1,2,..., n}. 
In this way, the subsets of £ are encoded by the 2" binary strings of length 
n. The empty subset of £ is encoded by the all-zero string = 00 • • • which 
represents the wild type, and the full set £ is encoded by the all-one string 
1 = 11 • • • 1 which represents the escape state. 

An order ideal g in a poset £" is a subset of £ that is closed downward; 
that is, if 62 € 5 and ei < 62, then ei S g. The set of all order ideals of £ 
forms a distributive lattice J{£) under inclusion. Birkhoff 's Representation 
Theorem [1251 Thm. 3.4.1] states that all distributive lattices have the form 
J{£) for a poset £. We write Q = J{£), and we call Q the genotype lattice. 

Example 1. Let £ be the trivial poset, where no two events are comparable, 
with \£\ = n. Then Q = J{£) is the Boolean lattice consisting of all subsets 
of £ ordered by inclusion. This means that all possible combinations of 
mutations are possible, and they can occur in any order. Each of the 2" 
binary strings g € {0, l}" represents a mutational pattern, or genotype. 

In general, the event poset £ does have non-trivial relations ei < 62. The 
relation ei < 62 excludes all genotypes g with g^^ = and ge2 = 1 from Q. 
The remaining genotypes g form a sublattice of the Boolean lattice {0, 1}", 
and this is precisely our distributive lattice G = J{£). Note that the lattice 
Q is ranked, with the rank function given by rank(5() = \g\. 
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Example 2. Consider a scenario with n = A mutation events, labeled 8 = 
{1,2,3,4}. Suppose that event 3 can only occur after events 1 and 2, and 
event 4 can only occur after event 2. This allows for precisely eight genotypes 

g = {0000,1000,0100,1100,0101,1110,1101,1111}. 

The event poset 8 and the genotype lattice Q are shown in Figure ^ 

A fitness landscape associates to each possible genotype a number which 
quantifies the reproductive capacity of an individual with that genotype 21 . 
We define a fitness landscape on the distributive lattice Q to be any function 

Q ^ M. The value i{g) at any G ^ is the fitness of the genotype 
g. Thus, the space of all fitness landscapes is the finite-dimensional vector 
space 

We shall consider certain special models of fitness landscapes, which are 
represented by linear subspaces of MP . In the following definitions, a geno- 
type g is regarded as a subset of the event poset where \8\ = n. A 
constant fitness landscape has the form ^{g) = a for some constant a. Thus 
the constant landscapes form a line through the origin in R^. A graded fit- 
ness landscape is a landscape on Q whose fitness values depend only on the 
rank. Equivalently, we have i{g) = a\g\ for constants oq, oi, . . . , a„. Thus, 
graded fitness landscapes form an (n-|- l)-dimensional linear subspace of M^. 

Our biological application in Section uses the graded fitness landscape 
model, which means that the fitness of a virus type depends only on the 
number of mutations it harbors. We shall model situations where a virus 
escapes from a wild type to a drug-resistant type 1. In this case, we assume 
a graded fitness landscape that is monotonically increasing with rank, i.e., 

ao < fll < 02 < • • • < On- 

This implies that the fitness landscape f has a unique local (and global) 
maximum at the drug resistant type 1, which is the top element in Q. 

We next introduce the mathematical framework for evolution on a fitness 
landscape. The general setup is as in the work of Reidys and Stadler [2Tj, 
but this is adapted here to our specific situation, where the genotypes form 
a distributive lattice Q. The order relation on Q, which comes from inclusion 
of subsets of induces a neighborhood structure on Q where the neighbors 
oi g ^ Q are the genotypes that strictly contain g, 

(1) N{g) := {heg \ gdh). 

Unlike the typical situation considered in .21^, this notion of neighborhood 
is not symmetric. To be precise, we have that h € N{g) implies g ^ N{h). 

This neighborhood structure implies that mutational changes are possible 
only upward in the genotype lattice. This structure models a directed evolu- 
tionary process from the wild type towards the escape state 1. Typically, 
our configuration space is a small subset of the Boolean lattice {0, 1}" of 
all binary strings. Indeed, in the course of viral evolution, a population will 
visit only a small fraction of {0, 1}", as most mutants are not viable. 
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Suppose that the number of genotypes in Q is m. We wish to define 
dynamics between the states of Q. To this end, we fix a hnear extension of 

and we introduce an mxm matrix of transition rates, written U = {ugh), 
whose rows and columns are indexed by genotypes g,h € Q. Each entry Ugh 
of the matrix U is a non- negative real number which is zero unless h € N(g). 
In the framework of algebraic combinatorics, it is convenient to think of the 
matrix U as an element in the incidence algebra of Q; see j251 Sec. 3.6]. 

We further assume that the non-zero mutation rates Ugh depend only on 
the events in h\g. Equivalently, the rate at which a collection of mutation 
events occurs is independent of which other mutations have already occurred. 
With this assumption, there are only n free parameters ^i,... ,;U„ in the 
matrix U, where He is the mutation rate of event e. Then 




Ueeh\g /^e if 5 C /l 

otherwise. 



In particular, if all rates are the same, say = /.ii = • • • = fin, then the 
entries of U are Ugh = /il'^^f I if g C h and Ug^ = otherwise. 

Example 3. For the genotype lattice G in Figure ^ the matrix U equals 
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Note that the entry in row g and column h of any power U equals Ug^ times 
the number of paths of length k from g to h in Q. In particular, = 0. 

Let f be a fitness landscape on Q and F = diag(f(g') | g G ^) the mxm 
diagonal matrix whose entries are the fitness values. The entry of the matrix 
product UF in row g and column h represents the probability of genotype g 
transitioning into genotype h in one step. A precise probabilistic derivation 
and interpretation will be given in the next section. 

We are interested in all mutational pathways that lead from the wild type 
to the escape state 1. Towards this end, note that the entry {g, h) of the 
matrix (UF)'^ represents the probability of genotype g evolving to genotype 
h along any mutational pathway (chain) of length k in the genotype lattice 
Q. The chains from to 1 in ^ are accounted for by the upper right hand 
entry of (UF)'^. Note that the matrix (UF)'^ is zero for k > n. 

To account for chains of arbitrary length, we consider the matrix 

(3) (I - UF)-^ - I = UF + (UF)2 + (UF)3 + • • • + (UF)'', 



6 



BEERENWINKEL, ERIKSSON, AND STURMFELS 



where I is the m x ni identity matrix. We summarize our discussion in the 
following proposition, which is proved by elementary matrix algebra. 

Proposition 4. The entry of the matrix ^ in row g and column h is zero 
unless g C h, in which case it is Ugh - ^{h) ■ Pgh{^) where Pgf^ is a polynomial 
function of degree \h\g\ — 1 on the space of all fitness landscapes M^. 

The polynomial ^^^(f) is the generating function for all chains from g 
to h in G- This will be made precise in the following corollary. We shall 
restrict ourselves to the most important case when g = is the wild type 
and /i = 1 is the escape state. Studying -Poi(f) only is no loss of generality 
because any interval of a distributive lattice is again a distributive lattice. 

Proposition m tells us that Poi(^) ^ polynomial of degree n — 1 in the 
unknown fitness values f{g), which are also written as fg, where g £ G- 

Corollary 5. The polynomial Poj(f) in the upper-right entry of ^ equals 

(4) ^'oi(f) = E faJa^-'-h.-., 

6=goCgiC---Cgfc=i 

where the sum runs over all chains from to 1 in the genotype lattice G- 

3. The risk of escape 

For a poset of events £ and the corresponding distributive lattice G = 
J{£), the risk polynomial of G is defined as the polynomial (jlj), which we 
denote by TZ{G',i)- The risk polynomial was introduced in |14| I15j . In 
this section we review the evolutionary dynamics model proposed in these 
papers, and we discuss the probabilistic meaning of the risk polynomial. 

Example 6. Let G be the genotype lattice in Figure ^ Then the risk poly- 
nomial TZ{G', f) is the following polynomial of degree three in six unknowns: 

1 + /looo + /oioo + /iioo + /oioi + /iiio + /iioi 
+/1000/1100 + /oioo/iioo + /oioo/oioi + /looo/iiio + /oioo/iiio 
+/1000/1101 + /oioo/iioi + /iioo/iiio + /iioo/iioi + /olOl/llOl 
+/1000/1100/1110 + /oloo/iioo/iiio + /looo/iioo/iioi 
+/0100/1100/1101 + /oloo/oioi/iioi- 

If we restrict the fitness landscape f to lie in a linear subspace of M^, then 
TZ{G', f ) specializes to a polynomial in fewer unknowns. For example, the risk 
polynomial for graded fitness landscapes is obtained from the specialization 
f{g) = a|g|. That risk polynomial has degree n — 1 and is denoted by 
TZ{G;ai, . . . ,an-i)- For instance, TZ{G',f) in Example IHl specializes to 

'R{G; oi, 02, as) = 1 + 2ai + 2a2 + 2a3 + 3aia2 + 4aia3 + 80203 + 5010203. 
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For constant fitness landscapes f = o , the risk polynomial is a polynomial 
in one unknown o. It is denoted TZ{Q]a). In our running example, 

7^(g; a) = 1 + 6a + lOa^ + 5a^. 

We now make precise the notion of risk of escape^ which will justify our 
definition of the risk polynomial. Our derivation is based on the model for 
the dynamics of a replicating population on a fitness landscape studied by 
Iwasa, Michor and Nowak jl4| I15j . See also the work of Wilke and the 
references given therein for approaches to computing fixation probabilities. 

A multistate branching process P consists of a set of genotypes along 
with a fitness landscape and mutation rates between genotypes. We assume 
a discrete time process, where in one generation an individual with genotype 
g has a random number of offspring following a Poisson distribution with 
mean Rg. Some of these offspring may be mutants according to the mutation 
rates Ugh- The parameter Rg is the basic reproductive ratio |191 Chap. 3]. 

We assume there is no interaction between individuals; each reproduces 
at a rate independent of the distribution of the population. Let ^ be the 
probability that one individual of genotype g has k children of type h. Then, 



(5) pIh 



_ {ughRgf ■ e-"^'-^^ 
'^''^ k\ 



The reproductive fitness fg is related to the reproductive ratio Rg by 
(6) fg = and Rg - 



Let be the probability of escape starting with one individual of genotype 
g, so 1— is the probability of extinction. In particular, is the probability 
that one resistant virus will not become extinct. Each of these probabilities 
is a function of the mutation rates Ugh and the reproductive ratios Rg. We 
assume that the Ugh are as in ((2)), but with Ugg = 1. Thus, each escape 
probability ^g can be expressed as a function of the //e for e G £^ and (using 
the relation the fitness values fgioig^Q. 

Theorem 7. If ^g <^ \ for g ^ 1, then the probability of escape on the 
fitness landscape f € starting with one individual of wild type 0, satisfies 

(7) Co « ei-/o-n'"<=-^(^;f)- 

Proof. The probability of extinction satisfies the recursive formula 

oo 

(8) = n E(i - ^'^)' • ^.v 

hDg k=0 
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Figure 2. Example of an event poset whose general risk 
polynomial is of degree 11 in 375 unknowns. 



Using the right hand side of ^ can be rewritten as follows: 



JJ exp((l - Ch)ughRg) ■ ex.p{-UghRg 



We conclude that 

loga-e^) 



ihUghRg 



exp j ^ -ihUghRg 



for all g G G- 



Under the assumption that .^^ <C 1 for g 1 , we can linearize the logarithms 
using the relation log(l — ^g) ^ — ^g. This implies, for g £ ^\{1}, 

^g ^ Rg ■ J2hDg ChUgh 

= 1-RgUgg ■ ^hDg ihUgh 
~ fg ' "^hDg ^hUgh- 

The theorem now follows by setting g = and expanding the last equation 
recursively. Here we are using the fact from @ that the product of the Ugh 
over any chain from to 1 in ^ equals HeGf f^e- D 

The typical situation of interest is a fitness landscape for which only the 
escape state has a basic reproductive ratio greater than one, i.e., 



Rl > 1 



and 



Rg<l 



for all 5 7^ i. 



When the positive numbers Rg are very small for g € ^\{1} then the approx- 
imation is valid, and it shows the crucial role that the risk polynomial 
7l{Q;i) plays in assessing the risk of escape from the wild type to the 
escape state 1. The theorem implies that the risk of escape of a population 
of N wild type viruses is (1 — ^q)^. In Section 1^1 we discuss the situation in 
which the population is not homogeneous at the time of intervention. 

The risk of escape is an important quantity in analyzing the invasiveness 
of pathogens and in assessing the success probability of medical interventions 
such as chemotherapy. However, putting this concept into practice depends 
on our ability to actually compute the risk polynomial. It turns out that 
methods from algebraic combinatorics lead to efficient algorithms for this 
task. In the Appendix, several methods are presented in detail. 
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Our method of choice from a practical perspective rehes on computing hn- 
ear extensions of the event poset E (Theorem 1151 Appendix). Our software 
implementation is available at http : / /bio .math. berkeley. edu/riskpoly7| 
. For an example of the efficiency of the software, let E be the poset in Fig- 
ure 121 on n = 12 events with cover relations z < 6 + z for 1 < i < 6 and 
i < 7+i for 1 < i < 5. Here the genotype lattice Q consists of 375 genotypes. 
The risk polynomial TZ{Q]i) is a polynomial of degree 11 in 375 unknowns 
fg. This polynomial has 224,750,298 monomials in the 375 unknowns, but 
we represent it as a sum of 2,702,765 products, one for each linear extension 
of the event poset £. Our software takes about ten seconds to compute this 
representation of 'R-{G'i f)- The result takes up 200MB of disk space. 

The univariate risk polynomial for this example is 

1 + 375a + 190880^ + 324498a^ + 2610169a^ + 11729394a^ + 32080336a^ + 
555979090^^ + 61448965a* + 42020208a^ + 16216590a^° + 2702765a". 

Thus, exact symbolic computations, as opposed to numerical approxima- 
tions, may be necessary and feasible when one is interested in assessing the 
risk of escape in applications like the one described in Section El below. 

4. Distributive lattices from Bayesian networks 

In this section, we present a family of statistical models that naturally 
gives rise to distributive lattices. This statistical interpretation provides a 
method for deriving the genotype lattice Q directly from data. The basic 
idea is to estimate the poset structure on £ from observed genotypes, by 
applying model selection techniques to a range of Bayesian networks, and to 
define Q as the set of all genotypes with non-zero probability in the model. 

We first make precise the derivation of a genotype space from a statis- 
tical model. Let £ be an unordered set of n genetic events. The events 
are labeled by 1,2, ... ,n. Subsets of £ are identified with binary strings 
g € {0, 1}"". They are the possible genotypes. We consider binary random 
variables = (Xi, . . . , where X^, = 1 indicates the occurrence of event 
e. Let A denote the (2" — l)-dimensional simplex of probability distributions 
on {0, 1}". A statistical model for X^- is a map p: O ^ A, where is some 
parameter space. The g-ih. coordinate of p, denoted pg, is the probability of 
genotype g G {0, 1}" under the model p. The induced genotype space of the 
model p: G ^ A is the set Qp of all strings g € {0, 1}" such that pg is not 
the zero function on 0. We regard Qp as a poset ordered by inclusion. 

Now consider a directed acyclic graph on the set of events £. We will also 
call this graph £. The Bayesian network model, or directed acyclic graphical 
model, defined by £ is the family of joint distributions that factor as 

Pr(Xi,...,X„) = []Pr(Xe lXp,(e)), 
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where pa(e) denotes the set of parents of e in Equivalently, a Bayesian net- 
work is specified by a set of conditional independence statements. Each node 
is independent of its ancestors given its parents. See JS] for an introduction 
to the relevant statistical theory and |13j for an algebraic perspective. 

The parameters for a Bayesian network are specified by providing, for 
each event e E a 2^^^^^^^ x 2 matrix 6^. The matrix entries are 



Set d = Eee£: 2'''''^''-" and 9 = [O,!]'^. The points in the cube 9 are 
identified with n-tuples of matrices 6* = (6*^ | e G <f ) as above. The general 
Bayesian network is the polynomial map p: Q ^ A whose coordinates are 



The general Bayesian network on £ induces the genotype space Qp = {0, 1}", 
the Boolean lattice on £. Indeed, the factorization (|lfl|) implies that no 
genotype g € {0, 1}" has probability zero for all parameter values. 

To obtain other genotype spaces, we replace the cube 9 = [0, 1]"^ by one 
of its faces, as follows. For each event e £ £ consider a Boolean function 
I3e : {0, IjP'^^^) ^ {0, 1}. If (3eige) = then the row of the 2lP^(^)l x 2-matrix 
6^ indexed by the genotype g is fixed to be the vector (1,0); otherwise that 
row remains indeterminate subject to the constraints Q. Let 9^^ denote the 
face of 9 determined by these requirements and : 9^ A the restriction 
of the polynomial map p to 9^. The resulting model is the Bayesian network 
on £ constrained by the Boolean functions P^. 

If all Boolean functions are disjunctions then we get the disjunctive 
Bayesian network on £. In this model, an event e can only occur if at least 
one of its parent events has already occurred. If all Boolean functions (3^ 
are conjunctions then we get the conjunctive Bayesian network on £. In 
this model, an event e can only occur if all of its parent events have already 
occurred. These restricted Bayesian network models induce interesting geno- 
type spaces. 9ur main result in this section concerns the conjunctive case. 

We regard the given directed acyclic graph £■ as a poset by setting ei < 62 
if there exists a path from ei to €2- We write p^°'^J : [0, 1]" A for the 
conjunctive Bayesian network on £, since it has precisely n free parameters. 

Theorem 8. The genotype space induced by the conjunctive Bayesian net- 
work on £ is the distributive lattice of order ideals in £, i.e., Qpcanj = J{£). 

Proof. The possible genotypes g are binary strings whose coordinates ge 
indicate whether or not the event e has occurred. If p is any of the Bayesian 
network models discussed above, then (|T(1|) implies that g £ Qp ii and only 




(10) 



9pa{e):5e ' 



EVOLUTION ON DISTRIBUTIVE LATTICES 



11 



if each dg^^^^^yg^ is non-zero. Consider now the conjunctive model p = p'^"^K 
Here, the conditional probability ^^p^j^j,^^ is non-zero if and only if = 1 
implies 5'pa(e) = (1; • • • > !)• This is precisely the condition for g to be an 
order ideal in £. Thus Qp is the distributive lattice of order ideals of f . □ 

The following example illustrates Theorem |H1 and it compares the geno- 
type spaces induced by the disjunctive and the conjunctive Bayesian net- 
work. The former is not a distributive lattice, but the latter always is. 

Example 9. Let £ be the event poset in Figure ^ The general Bayesian 
network model defined by £ is parametrized by the following four matrices: 

/ Coo 1 - Coo \ 

coi 1 - coi 

cio 1 - cio 

\ cn 1 - cii J 



-a ) , 



do 
di 



do 
di 



The map p: [0, 1]® — > A has coordinates 

Poooo = abcoodo, Poooi = 

Pooio = ab{l - coo)do, pooii = 

Poioo = a(l-6)coidi, poioi = 

Poiio = a{l - b){l - coi)di, poiii = 

Piooo = (1 - a)bciodo, Piooi = 

Pioio = (1 - a)ft(l - cio)do, pioii = 

Piioo = {I - a){l - b)cudi, puoi = 

Pino = (1 - a)(l - - cii)(ii, puu = 

This model induces the Boolean lattice {0, 1}^ as genotype space. 

The disjunctive Bayesian network is the six-dimensional submodel ob- 
tained by setting coo = 1 and do = 1- This substitution implies 

0. 



abcoo{l - do), 
ab{l - coo)(l - do), 
a{l - 6)coi(l - di), 
a(l-6)(l-coi)(l 
(1 - a)6cio(l - do), 
(l-a)6(l-cio)(l 
(l-a)(l-6)cii(l 



di), 

do), 
di). 



(l-a)(l-6)(l-cii)(l-(ii 



Poooi — POOlO 



Pooii — PlOOl 



PlOll 



The genotype space Gpdisj consists of the remaining eleven strings in {0, 1}^. 
Note that gpdisj is not a lattice because it is not closed under intersections. 
For instance, 1010 and 0110 are in ^^di^j but 0010 = 1010 n 0110 gpdisj. 

The conjunctive Bayesian network is the four-dimensional submodel ob- 
tained by setting coo = coi = cio = do = 1- The remaining eight non-zero 
probabilities are indexed by the eight genotypes in Figure ^ 

ab, poioo = a{l - b)di , 



Poooo = ab, Poioo 

Poioi = a(l -b){l- di), piooo 

Piioo = (1 - a)(l - 6)ciidi , piioi 

Pino = (1 - - - cu)di , puu 



{l-a)b, 

(l-a)(l-6)cii(l-di), 
(l_a)(l_5)(l_cii)(l-di). 



If £" is a directed forest, i.e., if every e € £ has at most one parent, then 
we can augment £" to a tree £'^ by adding an auxiliary root node which 
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points to the roots (edges with no parents) of the forest. On the resulting 
tree £^ we consider the mutagenetic tree model of . 

Proposition 10. If £ is a directed forest then the following three statistical 
models coincide: the disjunctive Bayesian network on £, the conjunctive 
Bayesian network on £, and the mutagenetic tree model on S'^ . 

Proof. The disjunctive and the conjunctive networks coincide because they 
are defined by the same speciahzations of the parameters 9'^. The identifi- 
cation with the mutagenetic tree model follows from Thm. 14.6]. □ 

Mutagenetic tree models can be learned from observed data by an efficient 
combinatorial algorithm. With appropriate edge weights that depend on the 
pairwise probabilities of events, a mutagenetic tree can be obtained as the 
maximum weight branching rooted at in the complete graph on {0, . . . , n}; 
see This gives an efficient method for learning the poset £, and hence 
the genotype lattice G = J{£), from data. It would be interesting to extend 
this model selection technique to arbitrary conjunctive Bayesian networks. 

5. Applications to HIV drug resistance 

We investigate the development of resistance during treatment of HIV 
infected patients with two different Pis. Consider the seven genetic events 

£ = {K20R, M36I, M46I, I54V, A71V, V82A, I84V} , 

where K20R stands for the amino acid change from lysine (K) to arginine (R) 
at position 20 of the protease chain, etc. The occurrence of these mutations 
confers broad cross-resistance to the entire class of Pis. Appearance of the 
virus with all 7 mutations renders most of the Pis ineffective for subsequent 
treatment. We analyze the risk of reaching this escape state under therapy 
with the Pis ritonavir (RTV) and indinavir (IDV) pilllTKj . 

We use mutagenetic trees for estimating preferred mutational pathways 
and for defining genotype lattices. For both drugs, a tree £'^ is learned from 
genotypes derived from patients under the respective therapy. We used 112 
and 691 samples from the Stanford HIV Drug Resistance Database |22| for 
ritonavir and indinavir, respectively. Figure |31 shows the inferred mutage- 
netic trees. The models indicate that the evolution of ritonavir resistance 
is partly a linear process, whereas indinavir resistance develops in a less 
ordered fashion. This is consistent with previous studies jl()|ll8j. The geno- 
type lattices G have size 16 for ritonavir and 45 for indinavir. We study the 
risk polynomials on these lattices under different fitness landscape models. 

For the constant fitness landscape on ^\{0, 1}, we obtain 

7^RTv(a) = 15o^ 70a^ 131o^ + 124a3 61a2 + 14a + 1, 
7^IDv(a) = 420a^ 1470a^ 1970a^ + 1250a3 + 372a2 43a 1. 

Thus, the risk of developing all seven PI resistance mutations is higher under 
indinavir therapy than under ritonavir: 7^idv(«) > '^RTv(a) for a > 0. 
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M36I 
\ 

K20R 
\ 

I84V A71V 
t \ 

M46I I54V K20R I54V A71V I84V 

\ / \ \ / \ 

V82A M36I V82A M46I 




(a) (b) 



Figure 3. Mutagenetic tree f ^ for the development of resis- 
tance to (a) ritonavir and (b) indinavir in the HIV-1 protease. 
The event poset £ is obtained by removing the root node "0" . 



Intuitively, the risk under ritonavir is lower because the mutations must 
occur in a certain order. Likewise, the high risk under indinavir results 
from many mutations occurring independently, which gives rise to a large 
genotype lattice and to many mutational pathways from the wild type to 
the escape state. 

More realistic fitness landscapes may be derived by modeling viral fitness 
as a function of drug concentration. We follow the approach pursued in [2^] 
and use a simple saturation function for this dependency. Specifically, we 
assume viral fitness to be the following function of drug concentration D, 

(11) fsiD) -- 



where (j)g denotes the fitness of genotype g in the absence of drug and rg 
the IC50 value of g, i.e., the drug concentration necessary to inhibit viral 
replication in vitro by 50%. The IC50 value is a measure of resistance. We 
will assume throughout that all (f)g = (p are equal. If we assume, in addition, 
that the resistance landscape is constant on ^\{0, 1}, with rg = r, then the 
substitution (|1H) turns the risk polynomial into a rational function in (p, D, 
and r. For example, for ritonavir, this rational function is 

(15(/)V2 + lOcpDr + lO0r2 + D'^ + 2Dr + r'^){(f>r + D + r)'^ 

{D + rf ■ 

In general, the IC50 values rg are distinct and can be determined experi- 
mentally for some genotypes by phenotypic resistance testing j28j . and may 
be predicted for all genotypes using regression techniques [2]. PI pheno- 
typic resistance data suggests a graded resistance landscape; see jH] and ^1 
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Figure 4. Graded resistance landscapes for ritonavir (RTV, 
bullets) and indinavir (IDV, squares). Resistance is quanti- 
fied as the drug concentration necessary to inhibit viral repli- 
cation in vitro by 50% (IC50). 



Tab. 3]. Hence, we estimate the resistance r € for ritonavir and indinavir 
by defining as the mean predicted IC50 of all genotypes of rank k. The 
resulting resistance landscapes are shown in Figure |3 

The graded risk polynomials 7^(ai, 02, 03, 04, 05, ae) have 64 terms. After 
substituting = 0/(1 + -D/r^), we obtain rational risk functions in D 
with parameter ^. Figure El illustrates the dependency of the risk on drug 
concentration for three different values of ^. For both drugs we indicate 
published mean plasma trough (Cmin) and peak (Cmax) levels observed in 
clinical settings. 

This example illustrates how the risk polynomial can be used to study 
viral escape as a function of different parameters. For instance, given a 
pharmacokinetics model of antiretroviral drug therapy, we can compute the 
risk of developing resistance after a patient has missed a dose. Thus, our 
mathematical framework may help in designing robust drug combinations. 
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Figure 5. Drug dependent risk. The log of the risk poly- 
nomial for ritonavir (a) and indinavir (b) is displayed as a 
function of plasma drug concentration D. Marked values de- 
note mean trough (Cmin) and peak (Cmax) levels observed in 
clinical studies. The parameter (j) is the relative fitness of 
mutants as compared to the wild type in the absence of drug. 



6. Discussion 

We have presented a computational framework for assessing the risk of 
escape of an evolving population of pathogens. The risk of escape is the 
probability that the population reaches an escape state before extinction. In 
virus transmissions, for example, this probability is the chance of survival in 
the new host. In the situation of antiretroviral therapy, the risk of escape is 
the probability of therapy failure due to the development of drug resistance. 

The general setup we consider for computing the risk of escape includes 
an event poset, a fitness landscape on its induced genotype lattice, and a 
branching process on this lattice. The event poset £■ consists of all muta- 
tional events that can occur and encodes the constraints which apply to their 
order of occurrence. Prom this structure the genotype space Q is obtained 
by considering all mutational pathways that respect the order constraints. 
This natural construction endows Q with the mathematical structure of a 
distributive lattice. The risk polynomial, the crucial factor in computing 
the risk of escape, turns out to coincide with the chain polynomial of the 
genotype lattice. We have presented methods from algebraic combinatorics 
that exploit this connection and that result in efficient algorithms. 

The space of genotypes may also be inferred from observed genotype data 
using statistical model selection tools. We have identified a class of Bayesian 
network models, the conjunctive Bayesian networks, whose support induces 
a genotype lattice. Mutagenetic tree models arise as important special cases. 
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Here, both statistical model selection and risk computation are particularly 
efficient, and readily available with existing software [H| coupled with our 
implementation of the linear extensions method (Theorem 1151 Appendix). 

We have focused on the dependency of the risk polynomial on the fitness 
landscape and considered throughout a homogeneous wild type population 
prior to intervention. However, the risk of escape is calculated similarly 
for a quasispecies distribution at the time of intervention. In fact, this 
involves computing the risk polynomial of the prior fitness landscape 514|. 
In contrast, the branching process model can not account for recombination, 
horizontal gene transfer, or frequency dependent selection, since evolution 
is assumed to take place in multiple lineages independently. 

The main challenge in using our method to compute the risk of escape 
from antiretroviral therapy lies in accurately modeling the fitness landscape. 
The dependency of the fitness on drug concentration may be improved 
by experimentally determined viral replicative capacities in the absence of 
drugs. An alternative approach to derive a fitness landscape for HIV-1 pro- 
teases is based on estimating the binding affinity of the drug to the mutant 
protease, and the mutant's ability to cleave its natural substrates |23| . These 
calculations are based on simplified molecular modeling techniques. The re- 
sulting fitness landscape does not account for different drug levels, but it is 
independent of experimental resistance and fitness data. 

Escape from indinavir and ritonavir therapy may in some cases involve 
mutations other than the seven we considered, although those are the most 
frequent mutations observed after therapy failure I18j . On the other 
hand, viral escape might be accomplished with genotypes that harbor fewer 
than all of the mutations. Thus it would be desirable to compute the risk of 
reaching any of several escape states, rather than only the 11 • • • 1 type. This 
computation will involve similar techniques to those presented in Section |21 
and the Appendix. 

Finally, the Pis form only one out of four distinct classes of antiretroviral 
drugs that are in current clinical use. The standard of care is combination 
therapy with at least three different drugs from two different drug classes. 
Modeling the fitness landscape of combination therapy in terms of viral drug 
resistance and drug exposure is even more challenging, but can eventually 
help in designing optimal antiretroviral therapies. Algebraic combinatorics 
offers tools for the mathematical analysis of these biomedical problems. 
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Appendix: Mathematics and computation of the risk polynomial 

Here we discuss in more detail mathematical properties of the risk poly- 
nomial and we present several methods for computing it. The given data 
consists of an n element poset £ and its induced genotype lattice Q, which 
is the distributive lattice of order ideals in £. We assume that Q has m 
elements, which are encoded either as subsets of £ or as binary strings in 
{0, 1}"". The risk polynomial is the polynomial TZiG', f) in the m unknowns 
fg = {{g), one for each genotype g. We are also interested in specializations 
of TZ{Q; f) obtained by setting some (or all) of the unknowns equal to each 
other, such as the graded risk polynomial and the univariate risk polynomial. 

Stanley's linear algebra method. A direct method for computing the 
risk polynomial is given in Section 1^1 Namely, we can set all He equal to 
one in the matrix U and then compute the upper right entry of the matrix 
(I — UF)~^ — I of equation Q- In practice, one would compute this entry 
by a dynamic program which runs in time 0{m?). That dynamic program 
is easily derived by resolving the recursion in the last equation of the proof 
of Theorem [7J 

The following alternative linear algebra technique for computing poly- 
nomials similar to our risk polynomials was given by Stanley in 24 . Let 
Q' = t/\{0,l} denote the genotype lattice with the top element 1 and the 
bottom element removed. We define A to be the anti- adjacency matrix of 
the truncated genotype lattice Q' . Thus A is the (m — 2) x (m — 2)-matrix 
with rows and columns indexed by and whose entry in row g and column 
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/i is if g C /i and is 1 otherwise. We write I for the (m — 2) x (m — 2) iden- 
tity matrix and F' = diag( f (gf) | g € Q'^ for the (m — 2) x (m — 2)-diagonal 
matrix whose entries are the fitness values. Stanley's result reads as follows. 

Theorem 11 (Stanley [^). The risk polynomial lZ{Q]i) equals the deter- 
minant of the [m — 2) X (m — 2) -matrix I + F' • A. 

Example 12. Let Q be the genotype lattice in Figure ^ Then m = 8 and 
I + F' • A is the 6 x 6-matrix 





1000 


0100 


1100 


0101 


1110 


1101 


1000 


1 + /lOOO 


/lOOO 





/lOOO 





\ 


0100 


/oloo 


1 + /oloo 














1100 


/llOO 


/llOO 


1 + /llOO 


/llOO 








0101 


/oioi 


/oloi 


/oloi 


1 + /oloi 


/oioi 





1110 


/lllO 


/lllO 


/lllO 


/lllO 


1 + /mo 


/lllO 


1101 


\ /llOl 


/llOl 


/llOl 


/llOl 


/llOl 


1 + /llOl / 



The determinant of this matrix is the risk polynomial of Example El 

The Hilbert series method. A more conceptual way of thinking about 
the risk polynomial is based on the following algebraic construction. The 
Stanley- Reisner ideal Igi of Q' is the ideal generated by all quadratic mono- 
mials fg-fh where g and h are genotypes that are incomparable, i.e., neither 
g h nor h Q g holds. The ambient polynomial ring S = M[f] is gener- 
ated by the unknowns fg where g (z Q' . The Hilbert series of Ig' is the 
formal sum over all monomials f" = Ilggg' fg^ which are not in the ideal 
Igr. This is a formal generating function which can be written as a rational 
function of the following form 

H{S/Igr,f) . ^^(f) 



Ug^G'i^-fg) 

Here Kg{{) is a polynomial in the unknowns fg with integer coefficients. The 
polynomial Kg(i) is known as the K-polynomial of the ideal Ig'. We refer to 
|17j for an introduction to Stanley- Reisner ideals and their K-polynomials. 

If £^ is a directed forest (and we identify fg = pg) then Proposition [Till and 
[3 Thm. 14.11] imply that the ideal Ig> is an initial monomial ideal of the 
conjunctive Bayesian network on £. In a forthcoming paper we shall prove 
that this initial ideal property holds for all event posets (not just trees). 

Example 13. Let Q be the genotype lattice in Figure Then 

k' = (/oioi/iiio, /iioi/iiiO) /oioi/iioO) /oioi/iooO) /oioo/iooo) 
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is indeed the initial monomial ideal of the conjunctive Bayesian network in 
Example El The K-polynomial Kg{{) equals 

1 — /oioi/iiio — /iioi/iiio — /oioi/iioo — /oioi/iooo — /oioo/iooo 
+/0100/1000/0101 + /looo/oioi/iioo + /looo/oioi/iiio + /oloi/iioo/iiio 
+/0101/1110/1101 + /oloo/iooo/iiio/iioi 
~/iooo/oioi/iioo/iiio — /oloo/iooo/oioi/iiio/iioi- 

Again using Proposition EH and Theorem 14.11 in [2] we see that the risk 
polynomial TZiG', f) is the sum of all squarefree monomials in the expansion 
of the Hilbert series H{S/Igi;i). Equivalently, TZ{G;i) is the reduction of 
H{S/Igi] f) modulo the ideal generated by the squares fg of the unknowns. 
Since 1/(1 — /g) equals 1 + modulo ( ), we have the following result. 

Proposition 14. The risk polynomial TZ{Q;f) of the genotype lattice Q is 
the sum of all squarefree terms in the expansion of 

Kg{t)-l[{l + fg), 

where Kg{{) is the K-polynomial of the Stanley- Reisner ideal Igi . 

The univariate risk polynomial 7^(^; a) is derived from f ) by re- 

placing each fg by the scalar unknown a. We have 

TZ{Q\a) = Co + cia + C2a^ H hCn-ia""^, 

where Cj is the number of chains of length i in Q' . Thus, (cq, . . . ,c„_i) is 
the /-vector of the simplicial complex of chains in Q' . Likewise, we get the 
graded risk polynomial from lZ{Q;i) by replacing each fg by a\g\. We note 
that the graded risk polynomial is related to Ehrenborg's quasi-symmetric 
function encoding of the flag /-vector of the chain complex of Q' . 

The linear extensions method. One advantage of both Theorem II II and 
Proposition El is that these formulas do not actually depend on the fact 
that ^ is a distributive lattice. They also apply if the set Q of genotypes 
is an arbitrary poset. This is relevant for our discussion of the statistical 
models in Section where we introduced a more general class of posets 

^pC{o,ir. 

This advantage is also a disadvantage: Theorem ^2 and Proposition El 
do not give the most efficient methods for computing TZ{Q;i) when Q is 
the distributive lattice induced by an event poset 8. In what follows we 
present a specialized and more efficient algorithm for the risk polynomial. 
The input to this algorithm consists of the event poset £. It is not necessary 
to compute the genotype lattice Q as this will be done as a byproduct of our 
approach, which is to compute the risk polynomial TZ{Q; f) directly from £. 

As before, we assume that £ has n elements, and we write [n] for the lin- 
early ordered set {1,2,..., n}. A linear extension of £ is an order-preserving 
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bijection tt: 8 ^ [n]. This means that e < e' in f imphes 7r(e) < vr(e'). Ev- 
ery hnear extension vr : f — s- [n] gives rise to an ordered hst of n— 1 genotypes 
c/(i),5(2)^ . . . ^(,("-1) in g' = ^\{6,i} as follows. The genotype is the 
subset of 8 consisting of all events whose image under tt is among the first 
i positive integers. In symbols, ^^^^ = 7r~^({l, 2, . . . , i}). The sequence 
g^^\g^'^\ ■ ■ ■ ,g^^~^\ derived from vr, represents a mutational pathway in Q. 

We now fix one distinguished linear extension of 8, that is, we identify 
the set underlying 8 with [n] itself. Then a linear extension is simply any 
permutation tt of [n] which preserves the order relations in 8. We define 

(12) f(7r) = n n 

j:7r(i)<7r(j+l) i:7r(j)>7r(i+l) 

where i runs over {1,2, . . . ,n — 1}. Our algorithm amounts to evaluating 
the risk polynomial by means of the following explicit summation formula. 

Theorem 15. The risk polynomial TZ{Q; f ) equals the sum of the products 
f(7r) where vr runs over all linear extensions of the event poset 8. 

Proof. The relationship between chains in Q and linear extensions of 8 is 
the content of [22, Prop. 3.5.2]. The distributive lattice G has a canonical 
R-laheling [23 Sec. 3.13] which assigns to each edge of the Hasse diagram of 
Q the corresponding element of 8. In view of this R-labeling, Exercise 59d 
in |251 Chap. 3] tells us that the poset Q' = ^\{0, 1} is chain-partitionable. 
Each product f(7r) as in (|12|) is the generating function for all the chains in 
precisely one part of that chain partition of Q'. Adding up all products gives 
the generating function for all chains, which is the risk polynomial. □ 

Example 16. The event poset 8 in Figure ^ has five linear extensions vr: 

vr f (vr) 

(1, 2, 3, 4) (1 + /iooo)(l + /iioo)(l + /mo) 

(1.2.4.3) (l + /iooo)(l + /iioo)/iioi 

(2.1.3.4) /oioo(l + /iioo)(l + /iiio) 
(2,1,4,3) /oioo(l + /iioo)/iioi 
(2,4,1,3) (l + /oioo)/oioi(l + /iioi) 

The sum of these five products equals the risk polynomial TZ{G; f). 

Implementation. Pruesse and Ruskey [20] showed that the linear exten- 
sions of a poset 8 can be computed in time linear in the number of linear 
extensions. Thus, their algorithm computes TZ{Q;{) in time linear in the 
size of the output of Theorem 1151 That output is in factored form (|12j) and 
is always more compact than the expanded risk polynomial. In this manner, 
we compute the risk polynomial in time sublinear in the size of the expanded 
risk polynomial. 

To obtain the univariate risk polynomial, we take the sum of the terms 
(1 + a)"'~^~^a^ , where 6 = <5(vr) is the number of descents of the linear 
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extension vr. Similarly, the graded risk polynomial 71{G; ai, . . . , an~i) is 
found by keeping track of the descent set of each linear extension vr. We 
believe that this method is best possible for general posets £. Notice that 
the leading term of the univariate risk polynomial is the number of linear 
extensions of £, and it is ^^P-complete to count linear extensions [Hj. 

When £ is a directed forest, the recursive structure can be used to help 
compute the risk polynomial. In this case, £ is built up by the operations 
of disjoint union and ordinal sum from the one element poset. For exam- 
ple, in the univariate case, the zeta polynomial |25| Sec. 3.11] of Q behaves 
nicely under these operations and can be used to write down the risk poly- 
nomial. Based on these considerations, we can design an efficient algorithm 
for computing the univariate risk polynomial of a directed forest. 

Using the method of Theorem 1151 we have developed software for com- 
puting risk polynomials. The input to our program is an arbitrary event 
poset £, and the output is the risk polynomial, the graded risk polynomial 
or the univariate risk polynomial. Optionally, the user can also input either 
exact fitness values or upper and lower bounds for each fitness value. The 
output in this case is either the exact risk of escape or upper and lower 
bounds for the risk. It is designed to integrate with the package Mtreemix 
[Sj, allowing the user to start with data, infer a mutagenetic tree, and then 
easily compute the risk polynomial. Our software is available at 



We use the algorithm of |2jl for computing linear extensions. Although 
this algorithm isn't asymptotically optimal, as shown in [20], it is simple to 
implement and efficient in practice. 




